Appendix A — Datasett

I oppgavene kan man bruke ulike datasett og gjøre omtrent de samme operasjonene. Hvert kapittel innleder med et empirisk eksempel som illustrerer hvordan det gjøres i R. Det du må finne ut av er hva hver del av koden gjør utover de ganske korte forklaringene i teksten.

En foreslått arbeidsmåte er da at du først gjør de samme analysene i eksempelet og sørger for at du skjønner hvordan ting fungerer. Deretter velger du et annet datasett (av de nedenforstående) og gjør en tilsvarende analyse på det datasettet. Så kan du fortsette med nye datasett ettersom hva du har tid til. Å gjøre flere analyser med ulike datasett er en fin forberedelse til eksamen.

Husk at for hvert datasett er det ulike relevante problemstillinger som har betydning for hvordan du gjennomfører og tolker resultatet. Du må derfor ikke ta lett på tekstoppgavene!

A.1 Innlesning av datasettene

Disse datasettene er i litt annerledes i nedlastingsformatet enn i eksempel-koden. Filene som er i rds-format (filnavnet slutter på .rds) leses inn med funksjonen readRDS() og skal fungere slik de er.

De filene som er i csv-format (filnavnet slutter på .csv) er en tekstfil der kolonnene er markert med skilletegn.1 Dette er typisk komma, men i nedlastningsformatet har dette blitt til semikolon. Det gjør at å lese inn med read.csv() ikke fungerer som forventet. Løsningen er å enten angi skilletegnet eksplisitt med sep = ";" eller bruke funksjonen read.csv2() i stedet for read.csv(). Her er eksempelkode:

Code
attrition <- read.csv("data/attrition.csv", sep = ";")   # Angir skilletegnet eksplisitt

attrition <- read.csv2("data/attrition.csv")             # Bruker en variant av samme funksjon der forvalget er semikolon

Alle datasettene er nå også gjort tilgjengelig i Canvas i to zip-mapper: 1) data i det formatet som ble brukt ved innlesning, og 2) alle i rds-format.

I disse oppgavene skal vi bruke flere forskjellige datasett. Last de ned og legg dem i din data-mappe.

A.2 Credit

Utfallsvariabel: “default” (misligholdelse av lån) Data er hentet fra datacamp.com

Dataene inneholder følgende variable:

Code
credit <- read.csv("data/credit.csv")

glimpse(credit)
Rows: 1,000
Columns: 17
$ checking_balance     <chr> "< 0 DM", "1 - 200 DM", "unknown", "< 0 DM", "< 0…
$ months_loan_duration <int> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48, 12…
$ credit_history       <chr> "critical", "good", "critical", "good", "poor", "…
$ purpose              <chr> "furniture/appliances", "furniture/appliances", "…
$ amount               <int> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6948, 3…
$ savings_balance      <chr> "unknown", "< 100 DM", "< 100 DM", "< 100 DM", "<…
$ employment_duration  <chr> "> 7 years", "1 - 4 years", "4 - 7 years", "4 - 7…
$ percent_of_income    <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2, 4, 4…
$ years_at_residence   <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4, 2, 4…
$ age                  <int> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 24, 2…
$ other_credit         <chr> "none", "none", "none", "none", "none", "none", "…
$ housing              <chr> "own", "own", "own", "other", "other", "other", "…
$ existing_loans_count <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2…
$ job                  <chr> "skilled", "skilled", "unskilled", "skilled", "sk…
$ dependents           <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ phone                <chr> "yes", "no", "no", "no", "no", "yes", "no", "yes"…
$ default              <chr> "no", "yes", "no", "no", "yes", "no", "no", "no",…

A.3 Attrition

Utfallsvariabel: “Attrition”, dvs om en arbeidstaker slutter i jobben.

Datasettet er tilgjengelig fra Kaggle.

Code
attrition <- readRDS("data/attrition.rds")
glimpse(attrition)
Rows: 1,470
Columns: 32
$ Age                      <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
$ Attrition                <fct> Yes, No, Yes, No, No, No, No, No, No, No, No,…
$ BusinessTravel           <fct> Travel_Rarely, Travel_Frequently, Travel_Rare…
$ DailyRate                <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
$ Department               <fct> Sales, Research & Development, Research & Dev…
$ DistanceFromHome         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
$ Education                <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
$ EducationField           <fct> Life Sciences, Life Sciences, Other, Life Sci…
$ EmployeeNumber           <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
$ EnvironmentSatisfaction  <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
$ Gender                   <fct> Female, Male, Male, Female, Male, Male, Femal…
$ HourlyRate               <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
$ JobInvolvement           <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
$ JobLevel                 <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
$ JobRole                  <fct> Sales Executive, Research Scientist, Laborato…
$ JobSatisfaction          <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
$ MaritalStatus            <fct> Single, Married, Single, Married, Married, Si…
$ MonthlyIncome            <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
$ MonthlyRate              <int> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
$ NumCompaniesWorked       <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
$ OverTime                 <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…
$ PercentSalaryHike        <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
$ PerformanceRating        <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
$ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
$ StockOptionLevel         <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
$ TotalWorkingYears        <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
$ TrainingTimesLastYear    <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
$ WorkLifeBalance          <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
$ YearsAtCompany           <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
$ YearsInCurrentRole       <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
$ YearsSinceLastPromotion  <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
$ YearsWithCurrManager     <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …

A.4 Kommunedata

Disse dataene er hentet fra SSBs offisielle statistikk og koblet sammen på kommunenummer. Fra statistikkbanken tabeller nr. 06944 (inntekt), 12210 (sosialhjelp/KOSTRA), 07459 (befolkning), 08487 (anmeldte lovbrudd). Flere variable kan kobles på. Merk: det er flere endringer i kommunestruktur, særlig i 2020. Kommunene er altså ikke helt det samme over tid.

Aktuelle utfallsvariable: Flere variable kan være aktuell som utfallsvariable. Prediktorer må nok omarbeides noe etter egne vurderinger (f.eks. omregne til per 1000 eller prosent, summere totaltall etc).

Code
kommune <- readRDS("data/kommunedata.rds")
glimpse(kommune)
Rows: 1,529
Columns: 28
$ kommune_nr             <chr> "0101", "0101", "0101", "0101", "0104", "0104",…
$ kommune                <chr> "Halden (-2019)", "Halden (-2019)", "Halden (-2…
$ year                   <dbl> 2015, 2016, 2017, 2018, 2015, 2016, 2017, 2018,…
$ bef_18min              <int> 3556, 3503, 3505, 3544, 3594, 3652, 3704, 3655,…
$ bef_18_25              <int> 3575, 3585, 3432, 3438, 3405, 3404, 3355, 3370,…
$ bef_26_35              <int> 3728, 3804, 3985, 4035, 4057, 4071, 4124, 4110,…
$ bef_totalt             <int> 30328, 30544, 30790, 31037, 31802, 32182, 32407…
$ menn_18_25             <int> 1847, 1865, 1813, 1819, 1789, 1802, 1789, 1810,…
$ menn_26_35             <int> 1880, 1919, 2005, 2062, 2063, 2083, 2113, 2134,…
$ menn_36_67             <int> 7067, 7051, 7085, 7057, 7418, 7453, 7408, 7407,…
$ menn_67plus            <int> 2496, 2624, 2697, 2806, 2671, 2777, 2856, 2895,…
$ menn_18min             <int> 1880, 1847, 1873, 1876, 1842, 1885, 1919, 1878,…
$ kvinner_18_25          <int> 1728, 1720, 1619, 1619, 1616, 1602, 1566, 1560,…
$ kvinner_26_35          <int> 1848, 1885, 1980, 1973, 1994, 1988, 2011, 1976,…
$ kvinner_36_67          <int> 6880, 6832, 6844, 6848, 7479, 7519, 7537, 7596,…
$ kvinner_67plus         <int> 3026, 3145, 3242, 3309, 3178, 3306, 3423, 3555,…
$ kvinner_18min          <int> 1676, 1656, 1632, 1668, 1752, 1767, 1785, 1777,…
$ inntekt_totalt_median  <int> 555000, 562000, 580000, 591000, 561000, 568000,…
$ inntekt_eskatt_median  <int> 451000, 453000, 470000, 480000, 449000, 456000,…
$ ant_husholdninger      <int> 13890, 14124, 14281, 14454, 15046, 15132, 15313…
$ shj_klienter           <int> 1183, 1137, 1099, 1128, 1155, 1129, 1152, 1137,…
$ shj_unge               <int> 262, 247, 248, 242, 267, 263, 238, 222, 307, 28…
$ vinningskriminalitet   <dbl> 19.7, 18.7, 16.5, 14.5, 24.5, 21.5, 18.0, 18.0,…
$ voldskriminalitet      <dbl> 11.2, 12.6, 12.3, 11.2, 7.8, 8.3, 8.7, 9.7, 6.8…
$ nark_alko_kriminalitet <dbl> 21.0, 21.9, 21.0, 20.3, 12.0, 10.2, 10.9, 10.1,…
$ ordenslovbrudd         <dbl> 18.5, 16.5, 14.9, 13.7, 8.9, 9.0, 9.1, 9.2, 8.2…
$ trafikklovbrudd        <dbl> 15.5, 16.3, 16.7, 19.2, 7.4, 6.3, 6.9, 8.0, 9.6…
$ andre_lovbrudd         <dbl> 25.5, 26.5, 26.1, 25.2, 12.1, 12.2, 11.9, 12.4,…

A.5 Churn

Code
churn <- read.csv("data/WA_Fn-UseC_-Telco-Customer-Churn.csv")
glimpse(churn)
Rows: 7,043
Columns: 21
$ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
$ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
$ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
$ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
$ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
$ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
$ MultipleLines    <chr> "No phone service", "No", "No", "No phone service", "…
$ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
$ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
$ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
$ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
$ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
$ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
$ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
$ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One …
$ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
$ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", "…
$ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
$ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
$ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…

A.6 Recidivism from Iowa prisons

Datasettet inneholder data på 26020 personer løslatt fra fengsel i staten Iowa, USA mellom 2010 og 2015. For hver person er det informasjon om hvorvidt de har blitt fengslet på nytt innen 3 år (dvs. fulgt til mellom 2013 og 2018).

Aktuell utfallsvariabel: “Recidivism…Return.to.Prison.numeric” Endre gjerne variabelnavn til noe kortere.

Datasettet er tilgjengelig fra Kaggle og er nærmere omtalt der.

Code
recidivism <- read.csv("data/3-Year_Recidivism_for_Offenders_Released_from_Prison_in_Iowa_elaborated.csv", stringsAsFactors = TRUE)

glimpse(recidivism)
Rows: 26,020
Columns: 12
$ Fiscal.Year.Released                      <int> 2010, 2010, 2010, 2010, 2010…
$ Recidivism.Reporting.Year                 <int> 2013, 2013, 2013, 2013, 2013…
$ Race...Ethnicity                          <fct> White - Non-Hispanic, White …
$ Age.At.Release                            <fct> Under 25, 55 and Older, 25-3…
$ Convicting.Offense.Classification         <fct> D Felony, D Felony, D Felony…
$ Convicting.Offense.Type                   <fct> Violent, Public Order, Prope…
$ Convicting.Offense.Subtype                <fct> Assault, OWI, Burglary, Traf…
$ Main.Supervising.District                 <fct> 4JD, 7JD, 5JD, 8JD, 3JD, , 3…
$ Release.Type                              <fct> Parole, Parole, Parole, Paro…
$ Release.type..Paroled.to.Detainder.united <fct> Parole, Parole, Parole, Paro…
$ Part.of.Target.Population                 <fct> Yes, Yes, Yes, Yes, Yes, No,…
$ Recidivism...Return.to.Prison.numeric     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

A.7 Compas

Utfallsvariabel: “Two_yr_Revidvism”

Data er hentet fra R-pakken fairmodels, modifisert datsett fra ProPublica

Code
compas <- readRDS("data/compas.rds")
glimpse(compas)
Rows: 6,172
Columns: 7
$ Two_yr_Recidivism    <fct> 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1…
$ Number_of_Priors     <int> 0, 0, 4, 0, 14, 3, 0, 0, 3, 0, 0, 1, 7, 0, 3, 6, …
$ Age_Above_FourtyFive <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
$ Age_Below_TwentyFive <fct> 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ Misdemeanor          <fct> 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0…
$ Ethnicity            <fct> Other, African_American, African_American, Other,…
$ Sex                  <fct> Male, Male, Male, Male, Male, Male, Female, Male,…

A.8 Diabetes rehospitalization

Data er beskrevet nærmere i Strack et al (2014) (se særlig tabell 1) og er tilgjengelig fra UCI machine learning repository

Utfallsvariabelen av interesse er readmitted, altså om pasienten blir lagt inn på nytt på et eller annet tidspunkt etter utskrivning.

Code
diabetic <- read.csv("data/diabetic_data.csv")
glimpse(diabetic)
Rows: 101,766
Columns: 50
$ encounter_id             <int> 2278392, 149190, 64410, 500364, 16680, 35754,…
$ patient_nbr              <int> 8222157, 55629189, 86047875, 82442376, 425192…
$ race                     <chr> "Caucasian", "Caucasian", "AfricanAmerican", …
$ gender                   <chr> "Female", "Female", "Female", "Male", "Male",…
$ age                      <chr> "[0-10)", "[10-20)", "[20-30)", "[30-40)", "[…
$ weight                   <chr> "?", "?", "?", "?", "?", "?", "?", "?", "?", …
$ admission_type_id        <int> 6, 1, 1, 1, 1, 2, 3, 1, 2, 3, 1, 2, 1, 1, 3, …
$ discharge_disposition_id <int> 25, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 3, 6, 1,…
$ admission_source_id      <int> 1, 7, 7, 7, 7, 2, 2, 7, 4, 4, 7, 4, 7, 7, 2, …
$ time_in_hospital         <int> 1, 3, 2, 2, 1, 3, 4, 5, 13, 12, 9, 7, 7, 10, …
$ payer_code               <chr> "?", "?", "?", "?", "?", "?", "?", "?", "?", …
$ medical_specialty        <chr> "Pediatrics-Endocrinology", "?", "?", "?", "?…
$ num_lab_procedures       <int> 41, 59, 11, 44, 51, 31, 70, 73, 68, 33, 47, 6…
$ num_procedures           <int> 0, 0, 5, 1, 0, 6, 1, 0, 2, 3, 2, 0, 0, 1, 5, …
$ num_medications          <int> 1, 18, 13, 16, 8, 16, 21, 12, 28, 18, 17, 11,…
$ number_outpatient        <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ number_emergency         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ number_inpatient         <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ diag_1                   <chr> "250.83", "276", "648", "8", "197", "414", "4…
$ diag_2                   <chr> "?", "250.01", "250", "250.43", "157", "411",…
$ diag_3                   <chr> "?", "255", "V27", "403", "250", "250", "V45"…
$ number_diagnoses         <int> 1, 9, 6, 7, 5, 9, 7, 8, 8, 8, 9, 7, 8, 8, 8, …
$ max_glu_serum            <chr> "None", "None", "None", "None", "None", "None…
$ A1Cresult                <chr> "None", "None", "None", "None", "None", "None…
$ metformin                <chr> "No", "No", "No", "No", "No", "No", "Steady",…
$ repaglinide              <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ nateglinide              <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ chlorpropamide           <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ glimepiride              <chr> "No", "No", "No", "No", "No", "No", "Steady",…
$ acetohexamide            <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ glipizide                <chr> "No", "No", "Steady", "No", "Steady", "No", "…
$ glyburide                <chr> "No", "No", "No", "No", "No", "No", "No", "St…
$ tolbutamide              <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ pioglitazone             <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ rosiglitazone            <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ acarbose                 <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ miglitol                 <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ troglitazone             <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ tolazamide               <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ examide                  <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ citoglipton              <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ insulin                  <chr> "No", "Up", "No", "Up", "Steady", "Steady", "…
$ glyburide.metformin      <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ glipizide.metformin      <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ glimepiride.pioglitazone <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ metformin.rosiglitazone  <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ metformin.pioglitazone   <chr> "No", "No", "No", "No", "No", "No", "No", "No…
$ change                   <chr> "No", "Ch", "No", "Ch", "Ch", "No", "Ch", "No…
$ diabetesMed              <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes…
$ readmitted               <chr> "NO", ">30", "NO", "NO", "NO", ">30", "NO", "…

A.9 Absenteeism

Dette er et syntetisk datasett som inneholder 8336 personer i en tenkt bedrift og hvor mange timer hver person har fravær fra jobben.

Data er tilgjengelig fra Kaggle

Code
absenteeism <- read.csv("data/MFGEmployees4.csv")
glimpse(absenteeism)
Rows: 8,336
Columns: 13
$ EmployeeNumber <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ Surname        <chr> "Gutierrez", "Hardwick", "Delgado", "Simon", "Delvalle"…
$ GivenName      <chr> "Molly", "Stephen", "Chester", "Irene", "Edward", "Erni…
$ Gender         <chr> "F", "M", "M", "F", "M", "M", "M", "M", "M", "M", "M", …
$ City           <chr> "Burnaby", "Courtenay", "Richmond", "Victoria", "New We…
$ JobTitle       <chr> "Baker", "Baker", "Baker", "Baker", "Baker", "Baker", "…
$ DepartmentName <chr> "Bakery", "Bakery", "Bakery", "Bakery", "Bakery", "Bake…
$ StoreLocation  <chr> "Burnaby", "Nanaimo", "Richmond", "Victoria", "New West…
$ Division       <chr> "Stores", "Stores", "Stores", "Stores", "Stores", "Stor…
$ Age            <dbl> 32.02882, 40.32090, 48.82205, 44.59936, 35.69788, 48.44…
$ LengthService  <dbl> 6.018478, 5.532445, 4.389973, 3.081736, 3.619091, 2.717…
$ AbsentHours    <dbl> 36.57731, 30.16507, 83.80780, 70.02017, 0.00000, 81.830…
$ BusinessUnit   <chr> "Stores", "Stores", "Stores", "Stores", "Stores", "Stor…

A.10 Human resources (HR)

Data er tilgjengelig fra Kaggle og variable er beskrevet nærmere på denne lenken.

Code
hr <- read.csv("data/HRDataset_v14.csv")
glimpse(hr)
Rows: 311
Columns: 36
$ Employee_Name              <chr> "Adinolfi, Wilson  K", "Ait Sidi, Karthikey…
$ EmpID                      <int> 10026, 10084, 10196, 10088, 10069, 10002, 1…
$ MarriedID                  <int> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0…
$ MaritalStatusID            <int> 0, 1, 1, 1, 2, 0, 0, 4, 0, 2, 1, 1, 2, 0, 2…
$ GenderID                   <int> 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1…
$ EmpStatusID                <int> 1, 5, 5, 1, 5, 1, 1, 1, 3, 1, 5, 5, 1, 1, 5…
$ DeptID                     <int> 5, 3, 5, 5, 5, 5, 4, 5, 5, 3, 5, 5, 3, 5, 5…
$ PerfScoreID                <int> 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3…
$ FromDiversityJobFairID     <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
$ Salary                     <int> 62506, 104437, 64955, 64991, 50825, 57568, …
$ Termd                      <int> 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1…
$ PositionID                 <int> 19, 27, 20, 19, 19, 19, 24, 19, 19, 14, 19,…
$ Position                   <chr> "Production Technician I", "Sr. DBA", "Prod…
$ State                      <chr> "MA", "MA", "MA", "MA", "MA", "MA", "MA", "…
$ Zip                        <int> 1960, 2148, 1810, 1886, 2169, 1844, 2110, 2…
$ DOB                        <chr> "07/10/83", "05/05/75", "09/19/88", "09/27/…
$ Sex                        <chr> "M ", "M ", "F", "F", "F", "F", "F", "M ", …
$ MaritalDesc                <chr> "Single", "Married", "Married", "Married", …
$ CitizenDesc                <chr> "US Citizen", "US Citizen", "US Citizen", "…
$ HispanicLatino             <chr> "No", "No", "No", "No", "No", "No", "No", "…
$ RaceDesc                   <chr> "White", "White", "White", "White", "White"…
$ DateofHire                 <chr> "7/5/2011", "3/30/2015", "7/5/2011", "1/7/2…
$ DateofTermination          <chr> "", "6/16/2016", "9/24/2012", "", "9/6/2016…
$ TermReason                 <chr> "N/A-StillEmployed", "career change", "hour…
$ EmploymentStatus           <chr> "Active", "Voluntarily Terminated", "Volunt…
$ Department                 <chr> "Production       ", "IT/IS", "Production  …
$ ManagerName                <chr> "Michael Albert", "Simon Roup", "Kissy Sull…
$ ManagerID                  <int> 22, 4, 20, 16, 39, 11, 10, 19, 12, 7, 14, 2…
$ RecruitmentSource          <chr> "LinkedIn", "Indeed", "LinkedIn", "Indeed",…
$ PerformanceScore           <chr> "Exceeds", "Fully Meets", "Fully Meets", "F…
$ EngagementSurvey           <dbl> 4.60, 4.96, 3.02, 4.84, 5.00, 5.00, 3.04, 5…
$ EmpSatisfaction            <int> 5, 3, 3, 5, 4, 5, 3, 4, 3, 5, 4, 3, 4, 4, 5…
$ SpecialProjectsCount       <int> 0, 6, 0, 0, 0, 0, 4, 0, 0, 6, 0, 0, 5, 0, 0…
$ LastPerformanceReview_Date <chr> "1/17/2019", "2/24/2016", "5/15/2012", "1/3…
$ DaysLateLast30             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Absences                   <int> 1, 17, 3, 15, 2, 15, 19, 19, 4, 16, 12, 15,…

A.11 Nettverk

Code
load("data/networkExample.RData")
glimpse(dataset)
Rows: 926
Columns: 26
$ degree               <dbl> 0.006282723, 0.002094241, 0.002094241, 0.00104712…
$ betweenness          <dbl> 0.0081438885, 0.0020810695, 0.0014569424, 0.00000…
$ closeness            <dbl> 0.08535931, 0.08049562, 0.08226376, 0.07795282, 0…
$ transitivity         <dbl> 0.13333333, 0.00000000, 0.00000000, 0.00000000, 0…
$ triangles            <dbl> 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0…
$ ChurnNeighbors       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ NonChurnNeighbors    <dbl> 6, 2, 2, 1, 3, 5, 2, 2, 2, 6, 2, 3, 6, 2, 3, 2, 2…
$ Neighbors            <dbl> 6, 2, 2, 1, 3, 5, 2, 2, 3, 6, 2, 3, 6, 2, 3, 2, 2…
$ RelationalNeighbor   <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.000…
$ ChurnNeighbors2      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ NonChurnNeighbors2   <dbl> 18, 4, 8, 4, 5, 19, 6, 8, 11, 15, 7, 6, 27, 4, 6,…
$ RelationalNeighbor2  <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0…
$ degree2              <dbl> 0.026178010, 0.006282723, 0.010471204, 0.00523560…
$ averageDegree        <dbl> 0.004363002, 0.003141361, 0.005235602, 0.00523560…
$ averageDegree2       <dbl> 0.004188482, 0.004973822, 0.004581152, 0.00549738…
$ averageTransitivity  <dbl> 0.13888889, 0.05000000, 0.03333333, 0.10000000, 0…
$ averageTransitivity2 <dbl> 0.11415344, 0.10833333, 0.22777778, 0.18511905, 0…
$ averageBetweenness   <dbl> 0.005713676, 0.004259980, 0.008147263, 0.00623771…
$ averageBetweenness2  <dbl> 0.006733850, 0.008557955, 0.007690396, 0.00625752…
$ averageTriangles     <dbl> 0.8333333, 0.5000000, 0.5000000, 1.0000000, 0.000…
$ averageTriangles2    <dbl> 0.7777778, 1.2500000, 0.7500000, 1.7500000, 0.400…
$ pr_0.85              <dbl> 0.0016432968, 0.0008315249, 0.0006479747, 0.00040…
$ pr_0.20              <dbl> 0.0011679051, 0.0010706518, 0.0009325680, 0.00088…
$ perspr_0.85          <dbl> 0.0016432968, 0.0008315249, 0.0006479747, 0.00040…
$ perspr_0.99          <dbl> 0.0017826047, 0.0006187399, 0.0006012571, 0.00030…
$ Future               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

A.12 Occupational wage data

Code
oes <- readRDS("data/oes.rds")
class(oes)
[1] "matrix" "array" 
Code
glimpse(oes)
 num [1:22, 1:15] 70800 50580 60350 56330 49710 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:22] "Management" "Business Operations" "Computer Science" "Architecture/Engineering" ...
  ..$ : chr [1:15] "2001" "2002" "2003" "2004" ...

A.13 Voters

Data er hentet fra 2016 Views of the Electorate Research Survey gjennomført av Voter study group. Full variabelliste er lastet opp i Canvas, der det står

Aktuell problemstilling er å predikere hvilke velgere som støtter Clinton. En slik klassifisering kan brukes til f.eks. å målrette budskap. En relatert problemstilling er å klustre velgerne for å finne segmenter.

Code
voters <- read.csv("data/voters.csv")
glimpse(voters)
Rows: 6,426
Columns: 42
$ RIGGED_SYSTEM_1_2016 <int> 3, 2, 2, 1, 3, 3, 3, 2, 4, 2, 3, 3, 4, 4, 3, 3, 2…
$ RIGGED_SYSTEM_2_2016 <int> 4, 1, 4, 4, 1, 3, 4, 3, 4, 3, 2, 2, 3, 2, 4, 3, 2…
$ RIGGED_SYSTEM_3_2016 <int> 1, 3, 1, 1, 3, 2, 1, 3, 1, 1, 1, 4, 1, 1, 1, 1, 3…
$ RIGGED_SYSTEM_4_2016 <int> 4, 1, 4, 4, 1, 2, 1, 2, 3, 2, 4, 1, 3, 4, 2, 2, 1…
$ RIGGED_SYSTEM_5_2016 <int> 3, 3, 1, 2, 3, 2, 2, 1, 3, 2, 2, 2, 3, 3, 2, 3, 2…
$ RIGGED_SYSTEM_6_2016 <int> 2, 2, 1, 1, 2, 3, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2…
$ track_2016           <int> 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 3, 2, 2, 2…
$ persfinretro_2016    <int> 2, 3, 3, 1, 2, 2, 2, 3, 2, 1, 2, 3, 2, 2, 2, 2, 2…
$ econtrend_2016       <int> 1, 3, 3, 1, 2, 2, 1, 3, 1, 1, 1, 3, 2, 1, 4, 3, 2…
$ Americatrend_2016    <int> 1, 1, 1, 3, 3, 1, 2, 3, 2, 1, 3, 3, 2, 1, 1, 3, 1…
$ futuretrend_2016     <int> 4, 1, 1, 3, 4, 3, 1, 3, 1, 1, 3, 1, 1, 4, 3, 4, 3…
$ wealth_2016          <int> 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1…
$ values_culture_2016  <int> 2, 3, 3, 3, 3, 2, 3, 3, 1, 3, 3, 2, 1, 1, 3, 8, 3…
$ US_respect_2016      <int> 2, 3, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3…
$ trustgovt_2016       <int> 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3…
$ trust_people_2016    <int> 8, 2, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 2, 8, 8, 2, 2…
$ helpful_people_2016  <int> 1, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 8, 1, 1…
$ fair_people_2016     <int> 8, 2, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 2, 2, 8, 2, 1…
$ imiss_a_2016         <int> 2, 1, 1, 1, 1, 2, 1, 1, 3, 1, 1, 1, 2, 1, 2, 2, 2…
$ imiss_b_2016         <int> 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1…
$ imiss_c_2016         <int> 1, 2, 2, 3, 1, 2, 2, 1, 4, 2, 3, 1, 2, 2, 3, 1, 1…
$ imiss_d_2016         <int> 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 2, 1, 1, 1, 3…
$ imiss_e_2016         <int> 1, 1, 3, 1, 1, 3, 1, 2, 1, 1, 2, 2, 4, 1, 4, 2, 1…
$ imiss_f_2016         <int> 2, 1, 1, 2, 1, 2, 1, 3, 2, 1, 1, 1, 2, 1, 3, 2, 2…
$ imiss_g_2016         <int> 1, 4, 3, 3, 3, 1, 3, 4, 2, 2, 1, 4, 1, 2, 1, 1, 4…
$ imiss_h_2016         <int> 1, 2, 2, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 3…
$ imiss_i_2016         <int> 2, 2, 4, 4, 2, 1, 1, 3, 2, 1, 1, 2, 1, 2, 2, 2, 3…
$ imiss_j_2016         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
$ imiss_k_2016         <int> 1, 2, 1, 1, 2, 1, 1, 4, 2, 1, 1, 3, 1, 1, 1, 1, 1…
$ imiss_l_2016         <int> 1, 4, 1, 2, 4, 1, 1, 3, 1, 1, 1, 4, 2, 1, 1, 1, 3…
$ imiss_m_2016         <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1…
$ imiss_n_2016         <int> 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1…
$ imiss_o_2016         <int> 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1…
$ imiss_p_2016         <int> 2, 1, 2, 3, 1, 3, 1, 1, 4, 1, 1, 1, 2, 3, 2, 3, 1…
$ imiss_q_2016         <int> 1, 1, 1, 2, 2, 1, 1, 4, 2, 1, 1, 3, 1, 1, 2, 2, 3…
$ imiss_r_2016         <int> 2, 1, 1, 2, 1, 2, 1, 2, 4, 2, 2, 1, 3, 2, 2, 2, 1…
$ imiss_s_2016         <int> 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 3…
$ imiss_t_2016         <int> 1, 1, 3, 3, 1, 1, 3, 4, 1, 1, 1, 3, 1, 3, 1, 1, 3…
$ imiss_u_2016         <int> 2, 2, 2, 2, 1, 3, 3, 1, 4, 2, 3, 2, 4, 3, 3, 3, 1…
$ imiss_x_2016         <int> 1, 3, 1, 2, 1, 1, 1, 4, 1, 1, 1, 2, 1, 1, 1, 2, 3…
$ imiss_y_2016         <int> 1, 4, 2, 3, 1, 1, 1, 3, 2, 1, 1, 3, 1, 1, 1, 2, 2…
$ Clinton_supp         <chr> "Yes", "No", "Yes", "No", "No", "Yes", "Yes", "No…

  1. Du kan åpne en slik fil i en Notepad eller annet tekstprogram vil du se hvordan det ser ut.↩︎